Lexical Normalization of Spanish Tweets with Rule-Based Components and Language Models
نویسندگان
چکیده
This paper explores three different methods of learning to map variant word form (dialectal or diachronic) to standard ones from a limited parallel corpus of standard and variant texts, given that a computational description of the standard morphology is available.
منابع مشابه
Lexical Normalization of Spanish Tweets with Preprocessing Rules, Domain-specific Edit Distances, and Language Models
We present a system to normalize Spanish tweets, which uses preprocessing rules, a domain-appropriate edit-distance model, and language models to select correction candidates based on context. The system’s results at SEPLN 2013 Tweet-Norm task were above-average.
متن کاملTweetNorm: a benchmark for lexical normalization of Spanish tweets
The language used in social media is often characterized by the abundance of informal and non-standard writing. The normalization of this non-standard language can be crucial to facilitate the subsequent textual processing and to consequently help boost the performance of natural language processing tools applied to social media text. In this paper we present a benchmark for lexical normalizati...
متن کاملDLSI en Tweet-Norm 2013: Normalización de Tweets en Español
The lexical richness and its ease of access to large volumes of information converts the Web 2.0 into an important resource for Natural Language Processing. Nevertheless, the frequent presence of non-normative linguistic phenomena that can make any automatic processing challenging. In this paper is described the participation in the Text Normalisation Workshop at the SEPLN conference (Tweet-nor...
متن کاملPrototipado Rápido de un Sistema de Normalización de Tuits: Una Aproximación Léxica
This work describes the system for the normalization of tweets in Spanish designed by the Language in the Information Society (LYS) Group of the University of A Coruña for Tweet-Norm 2013. It is a conceptually simple and flexible system, which uses few resources and that faces the problem from a lexical point of view.
متن کاملNormalizing tweets with edit scripts and recurrent neural embeddings
Tweets often contain a large proportion of abbreviations, alternative spellings, novel words and other non-canonical language. These features are problematic for standard language analysis tools and it can be desirable to convert them to canonical form. We propose a novel text normalization model based on learning edit operations from labeled data while incorporating features induced from unlab...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Procesamiento del Lenguaje Natural
دوره 52 شماره
صفحات -
تاریخ انتشار 2014